{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"../../img/ods_stickers.jpg\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 逻辑回归用于讽刺文本检测"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "cb01ca96934e5c83a36a2308da9645b87a9c52a0"
   },
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "本次挑战使用论文 [<i class=\"fa fa-external-link-square\" aria-hidden=\"true\"> A Large Self-Annotated Corpus for Sarcasm</i>](https://arxiv.org/abs/1704.05579) 提供的语料数据。该语料数据来源于 Reddit 论坛，挑战通过下面的链接下载并解压数据："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget -nc \"http://labfile.oss.aliyuncs.com/courses/1283/train-balanced-sarcasm.csv.zip\"\n",
    "!unzip -o \"train-balanced-sarcasm.csv.zip\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先，导入挑战所需的必要模块。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "ffa03aec57ab6150f9bec0fa56cd3a5791a3e6f4"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import warnings\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "from matplotlib import pyplot as plt\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import accuracy_score, confusion_matrix\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.pipeline import Pipeline\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "然后，加载语料并预览。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "b23e4fc7a1973d60e0c6da8bd60f3d921542a856"
   },
   "outputs": [],
   "source": [
    "train_df = pd.read_csv(\"train-balanced-sarcasm.csv\")\n",
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "查看数据集变量类别信息。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "0a7ed9557943806c6813ad59c3d5ebdb403ffd78"
   },
   "outputs": [],
   "source": [
    "train_df.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`comment` 的数量小于其他特征数量，说明存在缺失值。这里直接将这些缺失数据样本删除。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "97b2d85627fcde52a506dbdd55d4d6e4c87d3f08"
   },
   "outputs": [],
   "source": [
    "train_df.dropna(subset=[\"comment\"], inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "9d51637ee70dca7693737ad0da1dbb8c6ce9230b"
   },
   "source": [
    "输出数据标签，看一看类别是否平衡。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "addd77c640423d30fd146c8d3a012d3c14481e11"
   },
   "outputs": [],
   "source": [
    "train_df[\"label\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "5b836574e5093c5eb2e9063fefe1c8d198dcba79"
   },
   "source": [
    "最后，将数据切分为训练和测试集。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "c200add4e1dcbaa75164bbcc73b9c12ecb863c96"
   },
   "outputs": [],
   "source": [
    "train_texts, valid_texts, y_train, y_valid = train_test_split(\n",
    "    train_df[\"comment\"], train_df[\"label\"], random_state=17\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "ba1a8f65032c5954476a68e01b607655145b746d"
   },
   "source": [
    "### 数据可视化探索"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先，使用条形图可视化讽刺和正常文本长度，这里利用 `np.log1p` 对数据进行平滑处理，压缩到一定区间范围内。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "dadaf341602993a7854867a1df3004d0aa5d9b8c"
   },
   "outputs": [],
   "source": [
    "train_df.loc[train_df[\"label\"] == 1, \"comment\"].str.len().apply(np.log1p).hist(\n",
    "    label=\"sarcastic\", alpha=0.5\n",
    ")\n",
    "train_df.loc[train_df[\"label\"] == 0, \"comment\"].str.len().apply(np.log1p).hist(\n",
    "    label=\"normal\", alpha=0.5\n",
    ")\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看的，二者在不同长度区间范围（横坐标）的计数分布比较均匀。接下来，挑战需要利用 WordCloud 绘制讽刺文本和正常文本关键词词云图。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-question-circle\" aria-hidden=\"true\"> 问题：</i>参考 [<i class=\"fa fa-external-link-square\" aria-hidden=\"true\"> WordCloud 官方文档</i>](https://github.com/amueller/word_cloud) 绘制两类评论文本词云图，可自定义样式效果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install wordcloud  # 安装必要模块"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "词云非常好看，但往往看不出太多有效信息。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`subreddit` 表示评论归属于 Reddit 论坛子板块信息。下面，我们使用 `groupby` 来确定各子板块讽刺评论数量排序。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "6ba84720d54144e054b6963d78b48bf648e5c652"
   },
   "outputs": [],
   "source": [
    "sub_df = train_df.groupby(\"subreddit\")[\"label\"].agg([np.size, np.mean, np.sum])\n",
    "sub_df.sort_values(by=\"sum\", ascending=False).head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "上面的代码中，`np.size` 可以计算出不同子板块评论的总数。由于讽刺评论的标签为 1，正常评论为 0，所以通过 `sum` 求和操作就可以直接求出讽刺评论的计数。同理，`mean` 即代表讽刺评论所占比例。这是一个分析处理小技巧。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-question-circle\" aria-hidden=\"true\"> 问题：</i>沿用以上数据，输出子板块评论数大于 1000 且讽刺评论比例排名前 10 的信息。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "同理，可以从用户的维度去分析讽刺评论的比例分布。下面就需要分析得出不同用户 `author` 发表评论的数量、讽刺评论的数量及比例。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-question-circle\" aria-hidden=\"true\"> 问题：</i>输出发表评论总数大于 300，且讽刺评论比例最高的 10 位用户信息。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "416321f19f5a27290bc5622e8b3384b7bbbd28c6"
   },
   "source": [
    "### 训练分类模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，我们训练讽刺评论分类预测模型。这里，我们使用 tf-idf 提取文本特征，并建立逻辑回归模型。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "3048a070a56b08eb4e5fe2c54b6d14905031e74a"
   },
   "outputs": [],
   "source": [
    "# 使用 tf-idf 提取文本特征\n",
    "tf_idf = TfidfVectorizer(ngram_range=(1, 2), max_features=50000, min_df=2)\n",
    "# 建立逻辑回归模型\n",
    "logit = LogisticRegression(C=1, n_jobs=4, solver=\"lbfgs\", random_state=17, verbose=1)\n",
    "# 使用 sklearn pipeline 封装 2 个步骤\n",
    "tfidf_logit_pipeline = Pipeline([(\"tf_idf\", tf_idf), (\"logit\", logit)])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "下面就可以开始训练模型了。由于数据量较大，代码执行时间较长，请耐心等待。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-question-circle\" aria-hidden=\"true\"> 问题：</i>训练讽刺文本分类预测模型，并得到测试集上的准确度评估结果。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "fefd0178f43ce832031653be70f0a0e47f62cf4c"
   },
   "source": [
    "### 模型解释"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，挑战构建一个混淆矩阵的函数 `plot_confusion_matrix`。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "247a13fd3ae4d5c015c0ca0489a9a95d72ad7e9f"
   },
   "outputs": [],
   "source": [
    "def plot_confusion_matrix(\n",
    "    actual,\n",
    "    predicted,\n",
    "    classes,\n",
    "    normalize=False,\n",
    "    title=\"Confusion matrix\",\n",
    "    figsize=(7, 7),\n",
    "    cmap=plt.cm.Blues,\n",
    "    path_to_save_fig=None,\n",
    "):\n",
    "    \"\"\"\n",
    "    This function prints and plots the confusion matrix.\n",
    "    Normalization can be applied by setting `normalize=True`.\n",
    "    \"\"\"\n",
    "    import itertools\n",
    "\n",
    "    from sklearn.metrics import confusion_matrix\n",
    "\n",
    "    cm = confusion_matrix(actual, predicted).T\n",
    "    if normalize:\n",
    "        cm = cm.astype(\"float\") / cm.sum(axis=1)[:, np.newaxis]\n",
    "\n",
    "    plt.figure(figsize=figsize)\n",
    "    plt.imshow(cm, interpolation=\"nearest\", cmap=cmap)\n",
    "    plt.title(title)\n",
    "    plt.colorbar()\n",
    "    tick_marks = np.arange(len(classes))\n",
    "    plt.xticks(tick_marks, classes, rotation=90)\n",
    "    plt.yticks(tick_marks, classes)\n",
    "\n",
    "    fmt = \".2f\" if normalize else \"d\"\n",
    "    thresh = cm.max() / 2.0\n",
    "    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):\n",
    "        plt.text(\n",
    "            j,\n",
    "            i,\n",
    "            format(cm[i, j], fmt),\n",
    "            horizontalalignment=\"center\",\n",
    "            color=\"white\" if cm[i, j] > thresh else \"black\",\n",
    "        )\n",
    "\n",
    "    plt.tight_layout()\n",
    "    plt.ylabel(\"Predicted label\")\n",
    "    plt.xlabel(\"True label\")\n",
    "\n",
    "    if path_to_save_fig:\n",
    "        plt.savefig(path_to_save_fig, dpi=300, bbox_inches=\"tight\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "ff4a79b0368176a518fb0b84b45a508499e6183f"
   },
   "source": [
    "应用 `plot_confusion_matrix` 绘制出测试数据原始标签和预测标签类别的混淆矩阵。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "6df0c058a45b48b756e57e01a23bbc0974407195"
   },
   "outputs": [],
   "source": [
    "plot_confusion_matrix(\n",
    "    y_valid,\n",
    "    valid_pred,\n",
    "    tfidf_logit_pipeline.named_steps[\"logit\"].classes_,\n",
    "    figsize=(8, 8),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "6af3a7c93afef23ce9d215bf1daa2c91feb57d5d"
   },
   "source": [
    "实际上，这里利用 `eli5` 可以输出分类器在预测判定是文本特征的权重。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install eli5  # 安装必要模块"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "f62f3043b6e94fb6bbd5683a0e9662c572847fa6"
   },
   "outputs": [],
   "source": [
    "import eli5\n",
    "\n",
    "eli5.show_weights(\n",
    "    estimator=tfidf_logit_pipeline.named_steps[\"logit\"],\n",
    "    vec=tfidf_logit_pipeline.named_steps[\"tf_idf\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们可以发现，讽刺评论通常都喜欢使用 yes, clearly 等带有肯定意味的词句。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "be94f2065f86c65d5ee5590c9b2e5a541135732c"
   },
   "source": [
    "<img src=\"https://doc.shiyanlou.com/courses/uid214893-20190505-1557034785375\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "5648f6ad7a14ef3a582909f7c0c72c4fc80204aa"
   },
   "source": [
    "### 模型改进"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，我们期望模型能得到进一步改进，所以再补充一个 `subreddit` 特征，同样完成切分。注意，这里切分时一定要选择同一个 `random_state`，保证能和上面的评论数据对齐。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "aaefd2eb6f829f9eb9e9cd12c7903d3086182acc"
   },
   "outputs": [],
   "source": [
    "subreddits = train_df[\"subreddit\"]\n",
    "train_subreddits, valid_subreddits = train_test_split(subreddits, random_state=17)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来，同样使用 tf-idf 算法分别构建 2 个 `TfidfVectorizer` 用于 `comment` 和 `subreddits` 的特征提取。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "88be690ae260d824fbd8df4a4d02e5abcce0d5a7"
   },
   "outputs": [],
   "source": [
    "tf_idf_texts = TfidfVectorizer(ngram_range=(1, 2), max_features=50000, min_df=2)\n",
    "tf_idf_subreddits = TfidfVectorizer(ngram_range=(1, 1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-question-circle\" aria-hidden=\"true\"> 问题：</i>使用构建好的 `TfidfVectorizer` 完成特征提取。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "74b044c4b73653181ddde0f32e65a14d4867319c"
   },
   "source": [
    "然后，将提取出来的特征拼接在一起。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "bc458646e3c36792cf845f86cb9b6d6cc384b32c"
   },
   "outputs": [],
   "source": [
    "from scipy.sparse import hstack\n",
    "\n",
    "X_train = hstack([X_train_texts, X_train_subreddits])\n",
    "X_valid = hstack([X_valid_texts, X_valid_subreddits])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "d5d2505567c9303b2ce9f81c9bdc11fc799af91e"
   },
   "outputs": [],
   "source": [
    "X_train.shape, X_valid.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "a7f2040b76acbf2ff58e5dc68d7437ee6c3e9989"
   },
   "source": [
    "最后，同样使用逻辑回归进行建模和预测。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-question-circle\" aria-hidden=\"true\"> 问题：</i>使用新特征训练逻辑回归分类模型并得到测试集上的分类准确度。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "5fe6b371e91831198dda10d8b198c9e1ebfe07aa"
   },
   "source": [
    "不出意外的话，准确度会更高一些。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i class=\"fa fa-link\" aria-hidden=\"true\"> 相关链接</i>\n",
    "- [<i class=\"fa fa-external-link-square\" aria-hidden=\"true\"> Machine learning library Scikit-learn</i>](https://scikit-learn.org/stable/index.html)\n",
    "- [<i class=\"fa fa-external-link-square\" aria-hidden=\"true\"> Kernels on logistic regression</i>](https://www.kaggle.com/kashnitsky/topic-4-linear-models-part-2-classification)\n",
    "- [<i class=\"fa fa-external-link-square\" aria-hidden=\"true\"> ELI5 to explain model predictions</i>](https://github.com/TeamHG-Memex/eli5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div style=\"background-color: #e6e6e6; margin-bottom: 10px; padding: 1%; border: 1px solid #ccc; border-radius: 6px;text-align: center;\"><a href=\"https://nbviewer.jupyter.org/github/shiyanlou/mlcourse-answers/tree/master/\" title=\"挑战参考答案\"><i class=\"fa fa-file-code-o\" aria-hidden=\"true\"> 查看挑战参考答案</i></a></div>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
